Open In Colab

A currated list of useful tools for data analysis.

  • pandas_profile
  • pyviz
  • resumetable
  • feature_transform (my library)

I will explore the transformations using the wine data set from Kaggle.

from pathlib import Path


import pandas as pd
#import numpy as np
#from scipy.stats import kurtosis, skew
from scipy import stats
# import math
# import warnings
# warnings.filterwarnings("error")

from google.colab import drive
mnt=drive.mount('/content/gdrive', force_remount=True)


root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'redwine'
csv_path = (base_dir+'/winequality-red.csv')
df=pd.read_csv(csv_path)
Mounted at /content/gdrive
# https://gist.github.com/harperfu6/5ea565ee23aaf8461a840c480490cd9a

pd.set_option("display.max_rows", 1000)
def resumetable(df):
    print(f'Dataset Shape: {df.shape}')
    summary = pd.DataFrame(df.dtypes, columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name', 'dtypes']]
    summary['Missing'] = df.isnull().sum().values
    summary['Uniques'] = df.nunique().values
    summary['First Value'] = df.loc[0].values
    summary['Second Value'] = df.loc[1].values
    summary['Third Value'] = df.loc[2].values
    
    for name in summary['Name'].value_counts().index:
        summary.loc[summary['Name'] == name, 'Entropy'] = \
        round(stats.entropy(df[name].value_counts(normalize=True), base=2), 2)
    
    return summary

Typically, the first thing to do is examine the first rows of data, but this just gives you a very rudimentary feel for the data.

df = pd.read_csv(csv_path) 
df.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

I found resumetable() to be very convenient. We get a sense of cardinaltiy from Uniques and we can easily see where we are missing data.

Also, knowing the datatypes of each column is helpful when in comes to pre-processing the data.

I came across this funtion on Kaggle(I think) and found it incredibly helpful

resumetable(df)
Dataset Shape: (1599, 12)
Name dtypes Missing Uniques First Value Second Value Third Value Entropy
0 fixed acidity float64 0 96 7.4000 7.8000 7.800 5.94
1 volatile acidity float64 0 143 0.7000 0.8800 0.760 6.39
2 citric acid float64 0 80 0.0000 0.0000 0.040 5.87
3 residual sugar float64 0 91 1.9000 2.6000 2.300 4.78
4 chlorides float64 0 153 0.0760 0.0980 0.092 6.22
5 free sulfur dioxide float64 0 60 11.0000 25.0000 15.000 5.08
6 total sulfur dioxide float64 0 144 34.0000 67.0000 54.000 6.60
7 density float64 0 436 0.9978 0.9968 0.997 7.96
8 pH float64 0 89 3.5100 3.2000 3.260 5.91
9 sulphates float64 0 96 0.5600 0.6800 0.650 5.73
10 alcohol float64 0 65 9.4000 9.8000 9.800 5.19
11 quality int64 0 6 5.0000 5.0000 5.000 1.71

Another tool I use is pandas_profiling

import sys

!"{sys.executable}" -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension
from ipywidgets import widgets

# Our package
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file
profile = ProfileReport(df, title="red wine", html={"style": {"full_width": True}}, sort="None")

Takes a couple minutes to process and display the results even with a small dataset.

You do get some richer analysis like Correlation plots and distributions of the variables.

profile.to_widgets()
/usr/local/lib/python3.6/dist-packages/pandas_profiling/profile_report.py:424: UserWarning: Ipywidgets is not yet fully supported on Google Colab (https://github.com/googlecolab/colabtools/issues/60).As an alternative, you can use the HTML report. See the documentation for more information.
  "Ipywidgets is not yet fully supported on Google Colab (https://github.com/googlecolab/colabtools/issues/60)."

An alternative is Sweetviz. I tend to like this a bit better for its display of distributions. In general, it also loads a bit quicker.

!pip -q install sweetviz
     |████████████████████████████████| 15.1MB 331kB/s 
import sweetviz as sv
sweet_report = sv.analyze(df)
sweet_report.show_notebook(w=1200.)

My tool for numerical transformations

I wanted a simple way to view the distributions of the features and more importantly a way to view the data after numerical transformations such as Box-Cox or a log transform.

The following plot is a sample of what I developed.